Add M nearest-neighbour Chatterjee correlation (#990) by su-senka · Pull Request #1414 · boostorg/math

su-senka · 2026-07-01T06:04:55Z

Summary

Implements the revised (M nearest-neighbour) Chatterjee rank correlation of
Lin & Han (2021), addressing #990. Adds a new function
chatterjee_correlation_mnn(u, v, M) alongside the existing
chatterjee_correlation, with the same C++11 and C++17 overload structure.

The original coefficient has a detection boundary of n^(-1/4) for independence
testing, well short of the parametric n^(-1/2) rate. By using the M right
nearest neighbours of each point (rather than the single right neighbour) and
letting M grow with n, the revised statistic consistently estimates the same
dependence measure while approaching near-parametric efficiency. See Lin & Han,
On boosting the power of Chatterjee's rank correlation, Biometrika 110(2)
(2023) 283–299, arXiv:2108.06828.

Design notes

Separate function rather than an extended signature. The M-NN statistic
uses min(R_i, R_j) and a different normalisation, so even at M = 1 it is not
identical to chatterjee_correlation. A distinct function avoids silently
changing existing results and keeps the statistical intent explicit.
M is a required argument with no default.
Rank base. The internal rank() returns 0-based ranks; the paper's
formula uses 1-based ranks. The offset cancels in the existing M = 1 statistic
(which uses |R_i - R_{i+1}|) but not under min(.,.), so it is applied
explicitly. This is noted in a comment where it matters.
Complexity. O(n log n + nM). Near-linear for small M; tends to O(n²) as
M → n.
Parallel path. The outer index loop is partitioned across threads into
disjoint ranges, each reading the shared rank vector read-only (indices up to
i + M may fall in a neighbouring range; there are no writes). This differs
from the M = 1 parallel path, which splits the data array for the
difference-based transform.
Ties / degenerate input. Like chatterjee_correlation, the function
assumes distinct Y (continuous data). A constant Y returns a quiet NaN; this
is detected on the input directly, since rank() collapses tied values.
Choice of M. The asymptotic null variance is minimised at M ~ sqrt(n); the
choice is documented but left to the caller.

Tests

Added to test_chatterjee_correlation.cpp, covering float, double, and
long double:

Exact closed-form checks against the paper's Remark 2.5 (strictly increasing
and strictly decreasing dependence), which require no external reference.
Small exact spot values computed independently as rationals.
Constant-Y → NaN, and invariance under strictly increasing transforms of X
and Y.
Sequential/parallel agreement across several M (under the parallel build).

The sequential path was verified locally under b2 with cxxstd=14 and
cxxstd=17 (clang, arm64).

mborland

I've approved the workflow. This all looks good to me! @NAThompson if you have time could you give this a quick look?

codecov · 2026-07-02T13:56:45Z

Codecov Report

❌ Patch coverage is 99.30070% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 95.40%. Comparing base (0cfea2f) to head (2709fb2).
⚠️ Report is 8 commits behind head on develop.

Files with missing lines	Patch %	Lines
...e/boost/math/statistics/chatterjee_correlation.hpp	98.36%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1414      +/-   ##
===========================================
+ Coverage    95.39%   95.40%   +0.01%     
===========================================
  Files          826      827       +1     
  Lines        68919    69062     +143     
===========================================
+ Hits         65747    65891     +144     
+ Misses        3172     3171       -1

Files with missing lines	Coverage Δ
test/test_chatterjee_correlation.cpp	`100.00% <100.00%> (ø)`
...e/boost/math/statistics/chatterjee_correlation.hpp	`97.05% <98.36%> (+1.93%)`	⬆️

... and 3 files with indirect coverage changes

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cfea2f...2709fb2. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mborland · 2026-07-02T19:06:41Z

macOS failure has already been fixed on develop. Merging. Thank you for this contribution!

Add M nearest-neighbour Chatterjee correlation (boostorg#990)

2709fb2

mborland approved these changes Jul 2, 2026

View reviewed changes

mborland linked an issue Jul 2, 2026 that may be closed by this pull request

Can Chatterjee Correlation be improved? #990

Closed

mborland merged commit 8ee12a5 into boostorg:develop Jul 2, 2026
73 of 74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add M nearest-neighbour Chatterjee correlation (#990)#1414

Add M nearest-neighbour Chatterjee correlation (#990)#1414
mborland merged 1 commit into
boostorg:developfrom
su-senka:feature/chatterjee-mnn

su-senka commented Jul 1, 2026

Uh oh!

mborland left a comment

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

mborland commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

su-senka commented Jul 1, 2026

Summary

Design notes

Tests

Uh oh!

mborland left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mborland commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jul 2, 2026 •

edited

Loading